Genotype-Phenotype Taxonomy of Hypertrophic Cardiomyopathy

BACKGROUND: Hypertrophic cardiomyopathy (HCM) is an important cause of sudden cardiac death associated with heterogeneous phenotypes, but there is no systematic framework for classifying morphology or assessing associated risks. Here, we quantitatively survey genotype-phenotype associations in HCM to derive a data-driven taxonomy of disease expression. METHODS: We enrolled 436 patients with HCM (median age, 60 years; 28.8% women) with clinical, genetic, and imaging data. An independent cohort of 60 patients with HCM from Singapore (median age, 59 years; 11% women) and a reference population from the UK Biobank (n=16 691; mean age, 55 years; 52.5% women) were also recruited. We used machine learning to analyze the 3-dimensional structure of the left ventricle from cardiac magnetic resonance imaging and build a tree-based classification of HCM phenotypes. Genotype and mortality risk distributions were projected on the tree. RESULTS: Carriers of pathogenic or likely pathogenic variants for HCM had lower left ventricular mass, but greater basal septal hypertrophy, with reduced life span (mean follow-up, 9.9 years) compared with genotype negative individuals (hazard ratio, 2.66 [95% CI, 1.42–4.96]; P<0.002). Four main phenotypic branches were identified using unsupervised learning of 3-dimensional shape: (1) nonsarcomeric hypertrophy with coexisting hypertension; (2) diffuse and basal asymmetrical hypertrophy associated with outflow tract obstruction; (3) isolated basal hypertrophy; and (4) milder nonobstructive hypertrophy enriched for familial sarcomeric HCM (odds ratio for pathogenic or likely pathogenic variants, 2.18 [95% CI, 1.93–2.28]; P=0.0001). Polygenic risk for HCM was also associated with different patterns and degrees of disease expression. The model was generalizable to an independent cohort (trustworthiness, M1: 0.86–0.88). CONCLUSIONS: We report a data-driven taxonomy of HCM for identifying groups of patients with similar morphology while preserving a continuum of disease severity, genetic risk, and outcomes. This approach will be of value in understanding the causes and consequences of disease diversity.


UK Biobank participants
The UKB study recruited 500,000 participants aged 40 to 69 years old from across the United Kingdom between 2006 and 2010 (National Research Ethics Service, 11/NW/0382). 36This study was conducted under terms of access approval number 40616.In each case, written informed consent was provided.
A sub-study of UKB invited participants for CMR for assessment of cardiac chamber volumes and function using a standard protocol (1.5T, Siemens Aera, Siemens Medical Systems, Erlangen, Germany). 37As a reference population, we selected 16,691 participants that did not meet criteria for left ventricular hypertrophy and were classified as genotype negative (SARC-NEG) by having no variants in genes that may cause or mimic HCM (see Sequencing and variant categorisation).

Cardiac phenotyping using machine learning
Segmentation of the cine images in both UKB and HCM groups was performed using a deep learning neural network algorithm developed and optimised in-house.The performance of image annotation using this algorithm is equivalent to a consensus of expert human readers and achieves sub-pixel accuracy for cardiac segmentation. 38The label maps were super-resolved and registered to a cardiac atlas enabling consistent quantitative three-dimensional phenotypic analysis within and between patient groups. 39yocardial wall thickness was measured along radial line segments connecting the endocardial and epicardial surfaces perpendicular to the myocardial centreline and excluding trabeculae.Chamber volumes and mass were calculated from the segmentations according to standard post-processing guidelines. 40Myocardial strain analysis was performed using non-rigid free-form deformation image registration. 41Trabecular traits were quantified using fractal dimension (FD) analysis where a higher value indicates more complex trabeculation. 42

Sequencing and variant categorisation
Panel sequencing was completed in the HCM patient cohort, as previously described. 43The patients were sequenced using either a custom SureSelect capture panel targeting genes associated with inherited cardiac conditions or the Illumina TruSight Cardio panel.Sequencing was performed on either the SOLiD 5500xl platform, or the Illumina HiSeq, MiSeq or NextSeq platforms.The Singaporean cohort underwent targeted genetic sequencing using the TruSight Cardio panel and an equivalent pipeline as previously reported. 44atients were divided into three genetic strata.Patients carrying at least one potentially-causative rare variant (allele frequency <0.00004) 45 in any of 8 sarcomere-encoding genes robustly associated with HCM were considered genotype positive.These were further stratified into (i) those carrying variants previously confidently classified as pathogenic / likely pathogenic (SARC-P/LP) in ClinVar, and confirmed on our review, or else curated as P/LP according to ACMG criteria using the semi-automated CardioClassifier decision support tool 46 (n = 107), as previously published, 5 ; and (ii) those carrying sarcomeric variants of uncertain significance (SARC-VUS), comprising variants in the same 8 genes, that are consistent with known disease mechanisms and sufficiently rare, but with insufficient evidence to classify robustly as P/LP. 47,48ndividuals were classified as genotype negative (SARC-NEG) if they had no rare protein-altering variant (minor allele frequency <0.001 in the UKB and the Genome Aggregation Database) 49 in any of 25 genes that potentially cause HCM (definitive or moderate evidence according to international curation) 50 or cause syndromes that can present with isolated left ventricular hypertrophy (genocopies). 50In order to generate the most robust set of true genotype negatives, individuals carrying a protein-altering variant in any these 25 genes, but that was not sufficiently rare to be considered potentially causative of monogenic HCM was excluded from the analysis.Further details are given in Supplementary Materials.Common genetic variation contributes substantially to HCM risk and we also assessed the relationship between phenotype and polygenic score (PGS) derived from a case-control HCM genome wide association study (GWAS) in the 100,000 Genomes Project. 32
Splice region variants (outside the canonical splice donor and acceptor sites) were assessed in two ways; i) via ClinVar report: splice region variants found pathogenic with at least 2 star evidence for HCM in ClinVar and reported functional evidence for splicing were termed "splice confirmed"; if the functional evidence was unclear the protein consequence remained unchanged; if there was functional evidence of an alternative mechanism to splicing, the protein consequence was renamed (e.g.missense variant); ii) via prediction threshold: of the splice region variants, they were excluded if they did not meet the spliceAI threshold of >0.8, and these thresholds were used to identify potentially splice-causing variants of those splice region variants identified with a non-synonymous consequence flag (e.g.intron variant).
The pipeline then consisted of three main filtering steps which resulted in an output of four columns of binary code flagging genotype status (heterozygous, compound heterozygotes, and homozygotes, combined) as "1": SARC-NEG -Individuals who do not harbor any rare non-synonymous variants in any of the 25 genes of interest.This was a stringent filter to identify an unambiguous genotype-negative control group.
SARC-VUS -Individuals harboring rare variants in one or more of the 8 definitive HCM-associated sarcomere-encoding genes.Rare variants were restricted to known disease-associated variant classes.This step separated the variants into two subsets: i) Loss of function (LoF) alleles (group A), which contained only the gene MYBPC3, and filters for the protein consequences of stop gained, splice acceptor variant, splice donor variant, frameshift variant, and splice region variant (with additional in silico evidence of an effect on splicing).LOFTEE was incorporated in this step to exclude loss of function (LoF) variants that were flagged as "low confidence" (LC) and other LOFTEE flags, such as "NAGNAG sit" requiring reannotation to non-LoF variant status; ii) Protein altering (PAV) alleles (group B), which included all 8 sarcomeric genes, including MYBPC3, and filters for the protein consequences of missense variant, inframe insertion, and inframe deletion.
Both groups included additional positional annotation (LoF variants found in the last exon or 55bp into the penultimate exon or stop gained variants with a NMD flag using the NMD plugin), this included variants that introduce a proteintruncating variant (PTC) and predicted to lead to nonsense-mediated decay (NMD).The variants flagged 'coding sequence variant' and 'protein altering variant' were manually curated, as were 'stop lost' and 'start lost' which were examined via EN-SEMBL sequence and UCSC Genome Browser to identify in-frame rescues nearby.To be included in the SARC VUS group, the variants were required to meet a maximum gnomAD filter allele frequency (FAF) threshold for HCM (<0.00004) and excluded variants deemed P/LP for DCM on ClinVar.
SARC-P/LP is as SARC-VUS, plus annotated as P/LP according to the ACMG guidelines. 48Variants were reviewed if reported as P/LP for HCM by at least one submitter in ClinVar, or if flagged as P/LP by the CardioClassifier decision support software. 46Variants that did not meet either of these criteria were not individually reviewed.

Outcome measures
Data were collected to measure all-cause mortality in the HCM cohort.Outcomes were verified through search of the NHS Shared Care Records.Patients were followed up for a median of 10.2 years from date of study enrollment.

Statistical analysis and data modelling
Statistical analysis was performed with R (version 4.0.3) and RStudio Server (version 1.2; Boston, MA), unless otherwise stated.Variables were expressed as percentages if categorical, mean ± standard deviation (SD) if continuous and normal, and median ± inter-quartile range (IQR) if continuous and non-normal.Baseline anthropometric data were compared by Kruskal-Wallis tests and, if differences were identified, a Wilcoxon test was used for pairwise comparisons with Benjamini-Hochberg adjustment for multiple testing.Clustering of clinical data from HCM patients was performed using UMAP (uniform manifold approximation and projection) 54 for dimensionality reduction followed by a K-means algorithm evaluated with a silhouette score to find the optimal number of clusters.
The association between genotype and three-dimensional phenotype was assessed by fitting univariable regression models at each vertex of the cardiac mesh, controlling for false discovery, with corrected beta coefficients plotted on the epicardial surface. 5,55Clustering of subjects by their 3D left ventricular wall thickness, adjusted for age, sex and ancestry, was performed through partitioning the shared nearest neighbour (SNN) graph 56 with the multilevel refinement Leiden algorithm. 57The SNN graph and its partitions were determined using the functions available in Seurat. 58The clusters were visually inspected by UMAP projection and their stability was assessed through bootstrapping using fpc.We used DDRTree (discriminative dimensionality reduction via learning a tree) to project the 3D left ventricular wall thickness into a 2D tree structure to visualise the distribution of HCM phenotypes. 12,59he predictive power of the DDRTree mapping for P/LP genotype was tested with a generalized additive model (GAM) fitted on the tree coordinates, using a 10-fold cross validation repeated 3 times.Survival probability to median observed age was estimated with a Cox proportional hazards model fitted on the cubic spline of tree coordinates of each individual and their relative position in the tree.Association between the DDRTree mapping and polygenic risk score was assessed with a logistic regression GAM model using the subjects' coordinates as independent variables and the binarized polygenic risk score (thresholded at median) as outcome.The predictions were obtained using a 10-fold cross validation.
The primary survival analysis was performed in individuals from the HCM cohort with chronological age as time-scale and adjusting for genetic sex and ancestry (dichotomised by white European ancestry).Participants with SARC-P/LP variants were compared with pooled participants without variants and with SARC-VUS carriers.Proportional hazards assumption as assessed using Schoenfeld residuals was not violated.

Estimation of wall thickness from meshes
Three dimensional meshes with the local wall thicknesses (WT) were generated from the myocardial segmentations of the end systolic (ES) and end diastolic (ED) phases, using previously published methods. 55,60,61In order to make the modelling of the WT computationally tractable, the meshes were decimated by 99%.Specifically, because of the one-to-one correspondence between subjects' meshes vertices and atlas vertices, the decimation was applied to the atlas only, and the closest resulting vertices to the original atlas were selected from all meshes.This allowed us to preserve the correspondence between the individual vertices across all meshes.

Unsupervised analysis of wall thickness values
Wall thicknesses were firstly adjusted by age at cardiac magnetic resonance (CMR) imaging, sex and ancestry using a linear regression, and normalized afterwards using the Seurat package for R. 58 To remove the intrinsic correlations between the values, due to the spatial nature of the data, and to reduce the effect of statistical noise, the data was compressed applying a principal components analysis (PCA), and retrieving the first 50 principal components.
A shared nearest neighbor graph (SNN) was built from the compressed data, where the nodes corresponded to the subjects, and the edges corresponded to the Jaccard index between the nearest neighbours of each pair of subjects.The nearest neighbors corresponded to the 20 subjects with the smallest cosine distance from each subject.Thus, the SNN graph was partitioned using the multilevel Louvain algorithm, with resolutions varying from 0.1 to 1 in steps of 0.1.Both SNN and its partitioning were performed using Seurat package for R. 58 The optimal resolution was chosen by inspecting the cluster stability with clustree, corresponding to the partitioning before any mixing of cluster assignment was visible 62 (Supplementary Figs.II, III).Additionally, stability of the clusters was assessed by rerunning the clustering on 1000 random subsets, each with a size equal to 80% of the whole cohort, using the function clusterboot from the fpc package for R.
If more than one partitioning was found corresponding to the same branch structure of the clustree plot, that with the greater stability was chosen (Supplementary Tables I, II).

DDTree modelling of wall thickness values
DDRTree was applied on the adjusted wall thicknesses for age at CMR, sex and ancestry.Following the same approach described in the previous section, the first 50 principal components were calculated and used as input for the DDRTree algorithm. 19All parameters were kept as default.The underlying tree structure, the result of the procedure, was automatically partitioned into branches by considering as branching those points with a degree equal to 3. Subsequently, we merged small branches into their larger neighbour, such that the main structure of the tree was preserved.This step aimed at reducing the number of phenotypic sub-types and increasing the statistical power of the subsequent modelling.Firstly, branches consisting of less than 5 individuals were merged to the closest connected branch.Subsequently, short terminal branches, with a geodesic length smaller than 5% of the graph length were merged with their parent with respect to the center of the tree, as they represented low-diversity individuals.The branches connected to these were then merged, as bifurcation no longer existed (Supplementary Fig. VIII-IX).
A set of clinical and anthropometric measures were tested for statistical association with the tree branches.We first applied a Kruskal Wallis test to continuous variables and  2 test to discrete variables.Those variables showing a significant association (Benjamini-Hochberg adjusted P < 0.05) were post-hoc tested.A Dunn test was used to test each pair of branches, whereas the exact Fisher test was used for discrete variables in one-vs-all fashion.All P-values were adjusted for multiple testing using Benjamini-Hochberg method.
Finally, we tested the consistency of the tree structure after removing 8 related subjects from our original analysis.The association between the tree's main axes and 1) genotype status, and 2) PRS remained unchanged (Supplementary Table V-VI).Additionally, the relative position of the subjects was consistent between the two trees, as shown by the high value of the correlation between the pseudotime of the two models (Spearman's rho = 0.89, Supplementary Fig. X).

External validation
Wall thicknesses were residualized by linear regression using sex and age at scan as covariates.In this way, we could compare the similarities between the intra-cohort statistical properties of WT values.The first 5 principal components were estimated from the development cohort and used to project the adjusted Singaporean WT values.The development cohort principal component scores were used to fit two random forest models aimed at predicting the x and y tree coordinates.
Performances of the models were evaluated using a 10-fold cross-validation, repeated 3 times.The fitted models were therefore used to predict the tree coordinates of the Singaporean individuals, from their principal component scores.Faithfulness of the tree mapping for the Singaporean cohort was evaluated by the "trustworthiness" M 1 measure, which estimates how observations that are similar in the original high dimensional space are placed close to each other in the low dimensional space.It ranges from 0 to 1, with larger values corresponding to a better representative low dimensional mapping. 13n order to evaluate the consistency of the local statistical patterns between the development tree and the projected Singaporean individuals, we considered Spearman's correlation between nearest neighbour points in the tree.We estimated the correlations between the adjusted WT of the closest points in the development tree, and compared their distribution with that of the correlations between the Singaporean WT and their closest points in the development tree.In order to test differences in the distributions, we used a Wilcoxon test for difference of medians, and a Kolmogorov Smirnov test for difference of distributions.

Supplementary Tables I-XI and Figures I-XI
Supplementary Table I.Cluster stability at end diastole.Results from the analysis of the cluster stability for end diastolic wall thickness (ED WT) in 1000 subsets.The possible values range between 0 and 1, where 0 means that the cluster is not stable, and 1 that the clusters are identical in all repetitions.Between a resolution of 0.4 and 0.5, determined from the clustree plot (Supplementary Fig. II).We chose the resolution of 0.5 because it is also characterized by more stable clusters.Cluster 2 is found as an intermediate state between cluster 0 and 1, and it is less stable than the other two.The optimal resolution for the Louvain partitioning is found by inspecting the clustree plot (c.).In this case, the value of 0.5 was chosen, corresponding to the resolution with stable branching before any assignment mixing (diagonal arrows), and the largest bootstrapping stability (Supplementary Table I).The UMAP projections in a. and b. show the parallelism between the clusters and the genotypes.d.Genotype proportions by cluster.The optimal resolution for the Louvain partitioning is found by inspecting the clustree plot (c.).In this case, the value of 0.1 was chosen, corresponding to the resolution with stable branching before any assignment mixing (diagonal arrows), and the largest bootstrapping stability (Supplementary Table II).The UMAP projections in a. and b. show the parallelism between the clusters and the genotypes.d.Genotype proportions by cluster.The pseudotime calculated from the whole cohort tree (A) and that without unrelated subjects (B) shows that the subjects relative positions are highly correlated (Spearman's rho = 0.89) (C).This suggests that the tree generated after removing the related subjects was consistent with the one from the whole cohort.

Dimensionality reduction and unsupervised clustering of clinical features
Participant features comprised demographic data, clinical characteristics, CMR and echocardiographic measurements, and reported interventions and medicines (Supplementary Table XI).Missing values were inferred with the mice package for R. 63 Numerical features were converted to categorical variables by clustering groups of values into bins with a K-means algorithm. 64All categorical variables were then transformed into binary variables with one-hot encoding.
Dimensionality reduction was performed on this collection of binary variables with UMAP (uniform manifold approximation and projection) 65 using the following parameters: Dice metric, 25 components, 8 neighbouring sample points and a minimum distance between points of 10 −6 .Finally, unsupervised clustering was applied to the 25 resulting UMAP components with a K-means algorithm assessed and optimised with a silhouette score, revealing three clusters.Genotype status was found to be significantly associated with the clusters, using a  2 test.A post-hoc exact Fisher test was then performed to find cluster-specific enrichment.Adjustment for multiple testing was done with the Benjamini-Hochberg procedure, P < 0.05.Cluster 1 was significantly enriched with genotype-negative (NEG) subjects while cluster 3 was associated with genotype-positive (P/LP) and genotype-indeterminate (VUS) individuals.Feature importance from the initial set of participant data was assessed by applying a Kruskal-Wallis test for numerical features and a  2 test for categorical features.Significant associations (adjusted with the Benjamini-Hochberg method, P < 0.05) were further tested for cluster-specific enrichment: a Dunn test was used to test each pair of numerical features (Supplementary Fig. XIVA), while an exact Fisher test looked for one-vs-rest differences in categorical features (Supplementary Fig. XIVB).
The clustering revealed features characterising each group: 1) older female participants with lower body surface area (BSA), lower left ventricular (LV) volume, higher ejection fraction, hypertension, low activity score and on beta blockers and diuretic medications; 2) male participants with higher BSA, higher LV mass, higher LV maximum wall thickness, hypertension, moderate activity score and on medications for blood pressure (ACE/ARBs) and protective vascular (ASA/Clopi) medications; 3) younger participants with a family history of hypertrophic cardiomyopathy (HCM), no clinical cardiovascular risk factor and a high activity score.
tree from 3D end-diastolic wall thickness.a.The projection of patients' 3D end-diastolic (ED) wall thickness (WT) by the DDRTree dimensionality reduction reveal the presence of four main branches that are associated to specific morphological changes of the myocardium.Each branch is represented by the decimated ED atlas mesh, coloured accordingly to the beta coefficients resulting from testing the average difference between each branch individual and the other subjects.The yellow contour denotes the areas with a beta significantly different from zero.b.The continuous and discrete phenotypic variables found to be significantly associated to at least one branch.For left ventricular (LV) Gadolinium, labels are as follows: 1: None, 2: Minimal, 3: Moderate and 4: Severe.The significance for the enrichment of discrete variables is reported within the bars.ACE, Angiotensin-converting enzyme inhibitors; Aff, affected; ARB, Angiotensin receptor blockers; ASA, aspirin; Clopi, clopidogrel; LVOTO, Left ventricular outflow tract obstruction; SV, stroke volume.Only the significant pairs are reported with the symbols:*  ≤ 0.05; * *  ≤ 0.01; * * *  ≤ 0.001; * * * *  ≤ 0.0001,  = 436. .Selection of optimal resolution for clustering of end diastolic wall thickness.
. Selection of optimal resolution for clustering of end systolic wall thickness.
. Branch merging for ED phase.Intermediate results of the branch merging process for the ED phase data.The graph at the top-left shows the branch labels determined by 'monocle' that were progressively merged: a) short leaf branches, b) branches that did not have bifurcation into two different states.The final branches qualitatively correlate with the hierarchical structure of the tree, where the root state correspond to the center of the tree.merging for ES phase.Intermediate results of the branch merging process for the ES phase data.The graph at the top-left shows the branch labels determined by 'monocle' that were progressively merged: a) short leaf branches, b) branches that did not have bifurcation into two different states.The final branches qualitatively correlate with the hierarchical structure of the tree, where the root state correspond to the center of the tree.
positions of subjects in the original and unrelated subjects tree.

Supplementary Table II. Cluster stability at end systole.
Results from the analysis of the cluster stability for end systolic wall thickness (ES WT) WT in 1000 subsets.The possible values range between 0 and 1, where 0 means that the cluster is not stable, and 1 that the clusters are identical in all repetitions.Resolutions from 0.1 to 0.6 are determined from the clustree plot (Supplementary Fig.III).The resolution of 0.1 was chosen because it is the lowest value among those with more stable clusters.
Supplementary TableIV.Singaporean branch assignment.Singaporean HCM patients were assigned to the tree branches of their nearest neighbours in the development tree.In both end diastole (ED) and end systole (ES), most of the individuals were assigned to branches 1 and 4.

Table V. Association between ES tree coordinates and genotype status. Coefficients
of the logistic regression between the ES tree coordinates and the genotype status (genotype == "P/LP").The results for the whole cohort tree (left) and the unrelated subjects tree are consistent.1OR = Odds Ratio, CI = Confidence Interval Supplementary TableXI.Participant clinical features.MRI, magnetic resonance imaging; HCM, hypertrophic cardiomyopathy; SCD, sudden cardiac death CCS, Canadian Cardiovascular Society 66 ; NYHA, New York Heart Association 67 ; LV, left ventricular; RV, right ventricular; ACE, angiotensin-converting enzyme; ARB, angiotensin receptor blockers; ASA, acetylsalicylic acid.